Predicting Capital Bikes Departures through Supervised Machine Learning

DC’s Capital Bikeshare service has been increasing in popularity, especially during the COVID-19 pandemic, when Washingtonians needed alternatives for public transport. To date it has 5,000 bikes and 600+ stations across 7 jurisdictions. However, users find the service unreliable at times, especially at peak times. In this project, our goal is to use and optimize supervised machine learning models that can predict the number of ride-sharing bikes that will be used at any give hour. For simplicity, we apply our models to one station in particular that has particularly high demand due to its proximity to DC’s greatest tourist attraction: the Lincoln Memorial station. As such, our target variable will be the number of bikes that departed from that station at a given hour.

We chose predictors that vary by the hour that we believe are relevant to individuals’ choices of taking a capital Bikeshare bike. They are of three types:

Being able to predict Capital Bikeshare demand, could result in a more efficient allocation of bikes when stations are re-stocked at night. It could also inform and it could inform the eventual expansion of stations across strategic locations across the city to improve the experience of Washingtonians.

Data

We use Capital Bikeshare’s publicly available historic data, ranging from May 2020 until September 2021. The reason we chose this range is that before May 2020, the data were collected in a different format.

Our unit of observation is departures by the hour. So we would expect 24 rows per day. Since we have 17 months worth of data, we would expect roughly 365x24 + 150x24 = 12334 observations. In fact, after data cleaning, our dataset has 12334 rows, each indicating the number of hours that departed Lincoln Memorial at a given hour.

For weather predictors we used data from visual crossing

For sunlight we used data from the suncalc package.

Error Metric Choice

We will use RMSE as our error metric to judge how well our models do. RMSE is very sensitive to outliers as it squares them resulting in a bigger error compared to MAE for example. From the exploration done above, we know that there is a wide range within (e.g. in some hours it is 0, for example at night) and across days (e.g. for instance there are very few rides on Christmas day) in terms of hourly departures. Our motivation for this project is to reduce the chance that there will be no bikes available so that the service is more reliable. Therefore, we think that the RMSE will be better for our application purposes because it penalizes more the outliers that we do have in our data.

How much error is too much depends on how loose or strict we want to predict our outcomes. For our project, some error is probably accepted but being too loose in deciding what the error could be too costly. Based on the average departures per hour on any given day 3.45, we think an RMSE of 1-2 would be acceptable

Models performed

We ran three different models:

We start with Lasso given that we have many predictors and we were interested in variable selection and seeing the coefficients of the most important predictors. We then moved to non-parametric models, given that they tend to work well with large datasets like ours. We did a decision tree first and then tried a random forest, which is a more complex version of a decision tree and we expect to do better than the decision tree. Our best results came from the random forest model.